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Abstract 

Background: SARS coronavirus (SARS-CoV) was identified as the etiological agent of SARS, and 
extensive investigations indicated that it originated from an animal source (probably bats) and was 
recently introduced into the human population via wildlife animals from wet markets in southern 
China. Previous studies revealed that the spike (S) protein of SARS had experienced adaptive 
evolution, but whether other functional proteins of SARS have undergone adaptive evolution is not 
known. 


Results: We employed several methods to investigate selective pressure among different SARS- 
CoV groups representing different epidemic periods and hosts. Our results suggest that most 
functional proteins of SARS-CoV have experienced a stepwise adaptive evolutionary pathway. 
Similar to previous studies, the spike protein underwent strong positive selection in the early and 
middle phases, and became stabilized in the late phase. In addition, the replicase experienced 
positive selection only in human patients, whereas assembly proteins experienced positive 
selection mainly in the middle and late phases. No positive selection was found in any proteins of 
bat SARS-like-CoV. Furthermore, specific amino acid sites that may be the targets of positive 
selection in each group are identified. 


Conclusion: This extensive evolutionary analysis revealed the stepwise evolution of different 
functional proteins of SARS-CoVs at different epidemic stages and different hosts. These results 
support the hypothesis that SARS-CoV originated from bats and that the spill over into civets and 
humans were more recent events. 


Background continents within 3-4 months [1]. Soon after its first out- 
Severe acute respiratory syndrome (SARS) emerged in break, the etiological agent of SARS was identified as a 
Guangdong province of China in November 2002 and _ _ novel coronavirus [2-4], and its complete genome 
subsequently spread rapidly to 25 countries across five | sequence was determined [3,5,6]. The identification of 
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SARS-CoV in Himalayan palm civets and raccoon dogs in 
live animal markets in Guangdong, China, provided the 
first clue of an animal-to-human transmission [7,8]. Fur- 
ther studies indicated that civets were unlikely to be the 
natural reservoir [9]. Instead the detection of different 
SARS-like-CoVs in horseshoe bats (Rhinolophus spp.) 
seemed to suggest that bats might be the natural reservoir 
of SARS-CoV and many other closely related coronavi- 
ruses [10-13]. 


Like other coronaviruses, SARS-CoV is an enveloped, pos- 
itive-stranded RNA virus with a genome of approximately 
29,700 nucleotides. The genome contains at least 14 open 
reading frames (ORFs) that encode 28 proteins in three 
distinct classes: two large polyproteins Pla and Plab that 
are cleaved into 16 non-structural proteins (nsp1-nsp16) 
during viral RNA synthesis; four structural proteins (S, E, 
M and N) that are essential for viral entry and assembly; 
and eight accessory proteins that are believed to be non- 
essential for viral replication, but may facilitate viral 
assembly and play a role in viral virulence and pathogen- 
esis (Figure 1) [14-17]. 


Similar to all RNA viruses, SARS-CoV replication is associ- 
ated with genomic and antigenic variation. The @ ratio 
(dy/ds ratio of non-synonymous to synonymous substitu- 
tions) can measure the selective pressure at protein level, 
with @ = 1, < 1, > 1 indicating neutral selection, negative 
selection and positive selection, respectively [18,19]. Pre- 
vious studies have suggested that the S protein of SARS- 
CoV experienced positive selection during SARS epidemic 
[20-22]. However, these studies did not find or did not 
analyze for positive selection among the replicase or 
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accessory proteins, which may be equally important for 
SARS-CoV's adaptation to a new host. In order to system- 
atically investigate the adaptive evolutionary process of 
SARS-CoVs, we employed the branch-site model to ana- 
lyze the selective pressures that may act upon some key 
SARS-CoV functional proteins involved in virus entry, rep- 
lication and assembly. Our results suggest that diversified 
selective forces act upon different proteins and during dif- 
ferent epidemic phases. 


Methods 

Sequence data 

A total of 156 sequences of SARS-CoVs or bat SARS-like- 
CoVs were retrieved from GenBank (129 complete 
genomes and 27 partial genomes) (see additional file 1). 
Based on these sequences, three datasets were constructed. 
Dataset 1 contains all Spike genes. Dataset 2 is a merged 
dataset that includes sequences of 4 main replicase 
domains of SARS-CoV: papain-like protease (PLpro), 3C- 
like protease (3CLpro), RNA dependent RNA polymerase 
(RdRp) and Helicase (Hel). Dataset 3 is a merged dataset 
that includes sequences of 7 ORFs: ORF3a, E, M, ORF6, 
ORF7a, ORF7b and N genes. 


These protein-coding sequences are aligned based on 
translated protein sequences using Clustal W program 
implemented in BioEdit [23,24]. Prior to analysis all 
sequences that were identical to another within the data- 
set were removed, since previous studies have shown to 
have little effect on the detection of positive selection and 
contribute little evolutionary information [25]. Align- 
ment gaps were manually removed based on the reference 
sequence of 31-HPO3L_Tor2 (NC_004718). 
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The final composition of each dataset is as follows: dataset 
1 contains 3765 bp of 59 S gene sequences; dataset 2 
includes 35 sequences of replicase domains, 6435 bp in 
total (945 bp for PLpro, 918 bp for 3CLpro, 2769 bp for 
RdRp, 1803 bp for Hel) [17,26-28]; and dataset 3 contains 
56 combined sequences, 3666 bp in total (822 bp for 
ORF3a, 228 bp for E, 663 bp for M, 189 bp for ORF6, 366 
bp for ORF7a, 132 bp for ORF7b and 1266 bp for N). 


Phylogenetic analysis and reclassification of SARS-CoVs 
For each dataset, a phylogenetic tree was built with MrBayes 
3.1.2 (1,000,000 generations, sampled every 100 genera- 
tions, burnin = 500, 4 chains) [29]. The tree topologies pre- 
sented in figures 2, 3, 4 were used for different models. In 
previous studies, SARS-CoV isolates have been divided into 
five groups: 02-03 palm civets, 02-03 early, middle, late 
human patients, and 03-04 civet and human [20,21]. In 
the current study, we included an additional group contain- 
ing the bat SARS-like-CoVs. Based on tree topologies and 
epidemiological information, we reclassified each dataset, 
such as to enable us to realistically investigate the adaptive 
evolution of SARS-CoVs in different hosts and during dif- 
ferent epidemic periods. As showed in figures 2, 3, 4, the 
following groups were established: the BSL group, repre- 
senting bat SARS-like-CoVs; the PC03 group, representing 
isolates from palm civets in 2003; the HPEM group, repre- 
senting human patient isolates during early and middle 
epidemic phases in 2002-03; the HPL group, representing 
human patient isolates during late epidemic phase in 2003; 
the PCHP04 group, representing civet and human 
sequences from the 2003-04 epidemic phase; the HP03 
group, representing all isolates collected from human 
patients during the epidemic period of 2002-03; and the 
HPML group, representing human patient isolates collected 
during the middle and late epidemic phases in 2003; and 
finally, the SARS group, representing all isolates collected 
from civets and human patients in 2002-04. 


Detection of recombination and positive selection 

Since recombination can influence the detection of posi- 
tive selection, we first tested for recombination in our data 
sets by using a genetic algorithm for recombination detec- 
tion (GARD) [30]. Identified breakpoints by GARD were 
then assessed for significance by using Kishino-Hasegawa 
(KH) test in HYPHY package [31]. Since most sequences 
in SARS group have high similarity and increasing the 
number of sequences may dilute the signal, for each data- 
set, we choose 10 sequences for GARD analysis (four from 
BSL group: 124-Bat_SARS-273, 125-Bat_SARS-279, 126- 
Bat_SARS-HKU3, 127-Bat_SARS-Rp3; six from SARS 
group: 3-HPO3E_GZ02, 15-HP03M_BJO2, 31- 
HPO3L_Tor2, 106-HP04_GZ0402, 110-PC04_PC4-136, 
130-PC03_SZ13). 


To test for diversifying selection and to infer codon sites 
under positive selection, we mainly used CODEML pro- 
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gram in the PAML 4.1 software package, which is based on 
the maximum likelihood algorithm of Yang and cowork- 
ers [32]. Three kinds of models (branch-specific, site-spe- 
cific and branch-site) were employed to detect selective 
pressure among different branches and at different sites. 
The likelihood ratio test (LRT) was used to investigate 
whether the null hypothesis, where no positive selection 
is allowed, can be rejected against the alternative hypoth- 
esis, where positive selection is allowed [32]. The one 
ratio model (MO) assumes the same @ ratio for all 
branches and sites in the phylogeny. The free-ratio (FR) 
model assumes an independent o ratio for each branch in 
the phylogeny. MO and FR can be compared using LRT to 
examine whether ratios are different among lineages. 
The discrete model (M3) estimates w for three classes of 
codon. Comparing MO and M3 can test the variability of 
selective pressure among sites. When evidence for positive 
selection (@ > 1) was detected, the naive empirical Bayes 
(NEB) method was used to calculate posterior probabili- 
ties for site classes. A higher posterior probability suggests 
strong support for a site to be under positive selection. In 
brief, the branch-specific model assumes variation among 
branches, but not among sites; the site-specific model 
assumes variation of selective pressure among sites, but 
not among branches. Both models are widely used to 
investigate selective pressure. However, if adaptive evolu- 
tion occurs at a few time points and affects a few amino 
acids, these two models might lack power in detecting 
positive selection. To overcome this limitation, we also 
used branch-site model, which assumes that the o ratio 
varies both among sites and among lineages [33,34]. In 
the branch-site model A (model A), the lineages of interest 
are set to be foreground, and the other lineages to be back- 
ground. Selective constrains are assumed to vary across 
sites both along foreground and background, and a small 
fraction of sites only vary along foreground lineages. 
There are 3 w ratios for foreground (0 < @)< 1, @, = 1, a 
> 1) and 2 o ratios for background (0 < @)< 1, @, = 1) in 
branch-site model A. When evidence for positive selection 
(@ > 1) was detected, both naive empirical Bayes and 
Bayes empirical Bayes (BEB) were used to calculate poste- 
tior probabilities for site classes. Since NEB does not 
account for sampling errors, we used the BEB outputs as 
suggested by Yang [35]. The null model (model A’) is 
same as model A, but w, = 1 is fixed. Branch-site model 
tends to be the most powerful of the three tests. In order 
to investigate the variation of selective pressure in differ- 
ent epidemic periods and hosts, we set each group of 
SARS-CoVs as foreground in turn to implement branch- 
site model. However, in such multiple tests, the probabil- 
ity of false rejection of at least one null hypothesis can be 
high. So we used Bonferroni correction to control the false 
positive rate, as it has been shown to be powerful when 
applied to the branch-site test [36]. As to dataset 1 and 3, 
we applied branch-site model to 6 groups on the tree, and 
for dataset 2, we applied branch-site model to 5 groups. 


Page 3 of 15 


(page number not for citation purposes) 


BMC Evolutionary Biology 2009, 9:52 http://www.biomedcentral.com/1471-2148/9/52 


124-Bat_SARS-273 
126-Bat_SARS-HKU3 
125-Bat_SARS-279 
127-Bat_SARS-Rp3 


BSL 


1001 — 2-PC03_SZ3 


93-HPO3L_Sino1-11 
H00-—- 31-HPO3L_Tor2 
43-HP03L_SH-QXC1 
100 86-HPO3L_Sin3765V 
87-HPO3L_Sin845 


100;- 130-PC03 _SZ13 
131-PC03_SZ1 eee 
3-HPO3E_GZ02 
100 38-HPO3L_GD01 
57 4-HP03E_HGZ8L1-A 
gat 11-HPOSE_ZS-A 
29-HP03M_JMD 
00 #6 15-HP03M_BJ02 

e 16-HPO3M_BJ03 

24-HP03M_HZS2-C 
97-- 7-HPO3E_HSZ-Bb 
96 124 9-HP0O3E_HSZ-Cb 

997- 19-HPO3M_GZ50 
20-HP03M_GZ-A 

17-HP03M_Bu04 
100 27-HP03M_HZS2-Fb 

a 32-HPO03L_AS 
33-HPO3L_Frank 
40-HP03L_GZ-B 

41-HP03L_GZ-C 
99 50-HP03L_CUHK-LC2 de 


106-HP04_GZ0402 SARS 


133-PC04_PC4-115 
134-PC04_PC4-137 
100 105-HP04_GZ0401 
64 136-PC04_PC4-241 
144-PC04_B012 
100] 78 140-PC04_A021 
145-PC04_B024 
121-PC04_HC-SZ-DM1 
110-PC04_PC4-136 
111-PC04_PC4-13 
113-PC04_PC4-199 
135-PC04_PC4-205 
139-PC04_A013 PCHP04 
72 147-PC04_B033 
152-PC04_C018 
155-PC04_C028 
123-CFB04_SZ 
138-PC04_A001 
149-PC04_C013 
63 156-PC04_C029 
120-PC04_HC-SZ-79 
55 108-PC04_A022 
87 146-PC04_B029 
154-PC04_C025 
114-PC04_civet007 


0.001 117-PC04_HC-SZ-61 


Figure 2 
Phylogenetic relationships of 59 S gene sequences of SARS-CoVs from human and animals. The tree was gener- 


ated with MrBayes 3.1.2 program. Posterior probabilities are shown on the nodes of the tree. Branch between BSL group and 
others was depicted with dotted line, because the branch was too long to be displayed at same scale. Bar, 0.001 nucleotide 


substitutions per site. 
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tions per site. 


So we used 0.0083 as the significance level for each 
branch-site tests in dataset 1 and 3, used 0.01 as the signif- 
icance level for dataset 2. As indicated previously by Yang 
[33], these models sometimes fail to converge to maxi- 
mum likelihood estimates. We thus performed each anal- 
ysis at least twice using different starting values. Only 
identical data produced from both runs were considered 
reliable. All data are available upon request. 


In order to examine the robustness of those positive selec- 
tions identified by PAML, we also analyzed our datasets 


using HYPHY package accessed through the Datamonkey 
facility http://www.datamonkey.org[37]. Datamonkey 
includes three methods for detecting sites under selection: 
single likelihood ancestor counting (SLAC), fixed effects 
likelihood (FEL) and random effects likelihood (REL). 
REL method is often the only method that can infer selec- 
tion from small (5-15 sequences) or low divergence 
alignments and tends to be the most powerful of the three 
tests. So this method was run using the GTR substitution 
model on a neighbor-joining phylogenetic tree by the 
Datamonkey web server. In order to investigate selective 
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pressure among different hosts and epidemic phases, we 
split each dataset (S protein, replicase domains, 3'-end 
ORFs) into appropriate groups for analysis. 


Results 

Phylogenetic analysis 

For all genes analyzed, represented by S, replicase and 3'- 
end ORFs gene trees, at least four groups are apparent: BSL, 
PC03, HP03, PCHP04. As to the HP03 group, it can be sub- 
grouped into HPEM and HPL in S gene tree, and HPEM and 
HPML in 3'-end ORF tree. It should be noted that the pos- 
terior probabilities for several nodes are low and there are 
some polytomies. These uncertainties could be due to some 
sequences in SARS group have high amino acid similarity, 
especially for replicase and 3'-end ORFs which are more 
conservative. However, previous studies suggested that the 
LRTs and qualitative results of ML parameter estimation 
were rather insensitive to tree topology [35,38-42]. For 
branch-site model, Bayesian site identification might be 
affected by tree topology [40]. Remarkably, one isolate (38- 
HP0O3L_GD01), which was isolated in the later epidemic 
phase in 2003, always clustered with the early phase iso- 
lates. A possible explanation could be that this patient was 
infected in the early epidemic phase, which is supported by 
sequence analysis; this isolate has 29 extra nucleotides 
between ORF8a and ORF8b, a feature shared among iso- 
lates from civets and early phase patients. Another isolate 
(5-HP03E_HGZ8L1-B), which was isolated in the early epi- 
demic phase, tends to cluster with the middle phase isolates 
(Fig 2 and 4). This virus may be a transitional virus because 
it does not have the extra 29 nucleotides like the middle 
phase isolates. 
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Detection of recombination 

As showed in table 1 and additional file 2, GARD detected 
9 breakpoints in dataset 1, and KH test indicated that 1 
breakpoint (2301) was significant at p-value < 0.01 level. 
For dataset 2 and 3, GARD detected 2 and 5 breakpoints 
respectively, but none of them was significant after KH 
test. 


Positive selection on the S protein of SARS-CoV 

We analyzed the entire S gene of 156 isolates. Because sev- 
eral isolates were identical at the amino acid sequence 
level, we eliminated them from the dataset since previous 
analyses indicated that contribute limited evolutionary 
information [19,25]. Therefore, 59 sequences were com- 
piled into dataset 1. Table 2 presents the analysis results of 
dataset 1. The analyses of branch-specific model (FR) 
indicate that selective pressure varied along branches. 
Many branches in the HPEM and PCHP04 groups clearly 
have higher ratios. The LRT statistic for comparing MO 
and FR is significant, which confirm the heterogeneous 
selective pressure along branches. According to the site- 
specific model (M3), 1.3% sites among S protein are 
under positive selection with = 3.214. Furthermore, this 
model identifies 9 sites under positive selection at poste- 
rior probability p > 90% level (Table 2). All these sites are 
distributed within the $1 domain. 


The results of branch-site model revealed that no evidence 
of positive selection was found in the BSL, PCO3 and HPL 
groups. For the HPEM group, the results indicated that 
3.2% sites of S gene are subjected to strong positive selec- 
tion with o = 28.756. At p > 90% level, 14 specific sites 


Table |: KH tests verify the significance of breakpoints estimated by GARD analysis 


p-value 
Dataset Number of breakpoints AIC, improvement Breakpoint location LHS RHS 
Spike 9 588.485 776 1.000 0.220 
933 0.059 0.427 
1257 1.000 0.464 
1485 1.000 1.000 
2067 1.000 0.670 
2301 0.002 0.002 
2592 0.893 0.061 
2916 0.085 0.988 
3501 1.000 0.659 
Replicase 2 30.414 1230 1.000 0.507 
4398 1.000 0.040 
3'-end ORFs 5 254.203 454 0.363 0.306 
729 0.210 0.001 
1091 0.010 0.078 
1927 0.254 0.005 
3321 1.000 0.353 


KH test was used in both directions to compare phylogenies constructed from the alignment segment to the left hand side (LHS) and right hand 
side (RHS) of each estimated breakpoint. All p-values have been adjusted by Bonferroni correction. 
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Table 2: Maximum likelihood (ML) estimates for 59 S genes of SARS-CoV 


Models df. Parameters Parameters InLo 2Al P-value _ Positively 
under under (InL;) selected 
null alternative sites* 
model model 
Branch Model 114 MO (one ratio) Free Ratio -12834.110 355.006 <0.00! Not allowed 
MO vs. FR @ = 0.081 @ = O~co (-12656.604) 
Site Model 4 MO (one ratio) M3 (discrete, K = 3) -12834.110 436.204 <0.001 142,311, 430, 
MO vs. M3 @ = 0.081 Po = 0.732, Wp = 0.015 (-12616.008) 462, 479, 540 
p, = 0.255, , = 0.285 609, 626, 665 
pz = 0.013, o ,= 3.214 
Branch-site model A 
BSL group | MA' (fix @= 1) MA -12661.687 0 1.000 None 
as foreground Po = 0.912, Wp = 0.047 = py = 0.912, wy = 0.047 (-12661.687) 
MA' vs. MA p, = 0.088 p, = 0.088, w, = 1.000 
(Prt Po, = 0) (P2.*P2p = 0) 
PCO3 group | MA' (fix @= 1) MA -12658.258 0.024 0.877 None 
as foreground Po = 0.707, &p = 0.046 py = 0.782, wo = 0.046 (-12658.246) 
MA' vs. MA p, = 0.069 p, = 0.076, m »= 1.592 
(PrtPr» = 0.224) (PastPr = 0.142) 
HPEM group | MA' (fix @= 1) MA -12646.115 15.572 <0.001 49, 75, 344, 
as foreground Po = 0.587, ®p = 0.044 — py = 0.885, wy = 0.045 (-12638.329) 360, 501, 778 
MA' vs. MA p, = 0.055 p, = 0.083, w »= 28.756 794, 860, 861 
(Prat Pop = 0.358) (Poat Pop = 0.032) 1001, 1148, 1163 
1179, 124 
HPL group | MA' (fix @ = 1) MA -12650.732 0 1.000 
as foreground Po= 0.400, wy = 0.045 — py = 0.400, wo = 0.045 (-12650.732) 
MA' vs. MA p= 0.038 p= 0.038, w, = 1.000 
(Prat P2, = 0.562) (P2.*P2p = 0.562) 
PCHP04 group | MA' (fix @= 1) MA -12626,601 113.802 <0.001 78,91, 108, 
as foreground Po= 9.718, Wy = 0.045 = po= 0.901, p= 0.045 (-12569.700) 113, 147, 227, 
MA' vs. MA p= 0.057 p= 0.072, o »= 57.205 243, 425, 440, 
(Prat Pop = 0.225) (Prat Pop = 0.027) 462, 479, 609, 
613, 632, 743, 
765, 839, 844, 
856, 900, 1052 
1080 
SARS group | MA' (fix @= 1) MA -12498.107 19.274 <0.001 2,7,9, 12, 14, 20, 
as foreground Po= 0.753, p= 0.024 — pg= 0.792, wo= 0.027 (-12488.470) 27, 33, 37, 43, 58, 68, 
MA' vs. MA p,= 0.035 p= 0.034,  .= 1.989 70, 75, 84, 107, 108, 


(Poat Pr» = 0.212) 


(PoatPap = 0.174) 


131, 134, 137, 139 
147, 151, 154, 163, 165, 
167, 169, 174, 199, 201, 
214, 227, 230, 237, 239 
242, 243, 244, 248, 249, 
294, 333, 336, 344, 353, 
391, 392, 426, 431, 

440, 442, 457, 459, 462, 
479, 480, 487, 488, 494, 607, 
613, 644, 729, 732, 743, 
754, 758, 765, 778, 

1052, 1080, 1148, 1163 


* Positively selected sites are identified with posterior probability p > 90%. In boldface, p > 95%. 


were identified as potentially under positive selection 
(Table 2). For the PCHP04 group, 2.7% codon sites of the 
S gene are driven by strong positive selection with @ = 
57.205. Twenty two positively selected sites were identi- 
fied in this group (p > 90%). Fourteen of them are in S1 
and eight in S2 domain. For the selection of entire SARS- 
CoVs from the two epidemics, the branch-site model A 
analyses indicated that 17.4% sites are under positive 
selection with @ = 1.989. A total of 74 sites were identified 


as potentially under positive selection along these line- 
ages at 90% cutoff. In order to intuitively represent the 
distribution of these positively selected sites, we con- 
structed the additional file 3, from which we can find that 
most of these sites distribute in $1 domain. 


HYPHY package analysis accessed through Datamonkey 
facility also detected positive selection in HPEM and 
PCHP04 groups (with dy-dy = 0.061 and 0.938 respec- 
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tively), but not in BSL, PCO3 and HPL groups. As indi- 
cated in table 3, we also identified some positively 
selected sites, most of which are identical to those identi- 
fied by the branch-site model A. 


Positive selection on replicase domains of SARS-CoV 
PLpro, 3CLpro, RdRp and Hel are the major domains for coro- 
navitus replication [43,44]. We merged these four domains 
into one supergene for analysis because: 1) Yang et al. reported 
that gene concatenating analysis produced same outcomes as 
those obtained from analysis of separate genes [42]; 2) sepa- 
rate analysis results in mechanical repeats; 3) concatenating 
analysis can provide additional information because of addi- 
tional number of sequences for the merged dataset, compared 
to separate dataset. Therefore, dataset 2 consists of 35 concate- 
nated sequences from 129 complete genomes. 


As presented in table 4, the results of branch model analysis 
reveal that the @ ratio varies from 0 to infinite along differ- 
ent branches. This implies that selective pressures among 
these domains vary in different hosts and at different epi- 
demic phases, though these domains are the most con- 
served regions in CoV. Analysis using the discrete model 
(M3) detected no sign of positive selection in the dataset 2, 
although it suggests that the @ ratios vary significantly 
among different amino acid sites as indicated by LRT. 


Table 3: REL analysis results for three datasets 


http://www.biomedcentral.com/1471-2148/9/52 


Utilizing the branch-site model A analysis indicated that 
there is no positive selection in the BSL, PC0O3 and 
PCHP04 groups. However, the model A analysis revealed 
that among HP03 group about 8.1% codon sites of these 
4 domains are potentially under strong positive selection 
with @ = 11.093 and 28 sites were identified (7 in PLpro, 
5 in 3CLpro, 7 in RdRp, 9 in HEL). Weak positive selection 
(dy-ds = 0.001) was also detected from HP03 group by 
using HYPHY package but not other groups (Table 3). 


Positive selection on 3'-end ORFs of SARS-CoV 

The 3'-end of SARS-CoV genome encodes 11 ORFs: 
ORF3a, ORF3b, ORF4 (E), ORF5 (M), ORF6, ORF7a, 
OREF7b, ORF8a, ORF8b, ORF9a (N), and ORF9b. The E, 
M, N proteins are structural proteins of SARS-CoV and the 
other proteins are accessory proteins. Because the coding 
regions of ORF3b and ORF9b overlap partially or com- 
pletely with those of ORF3a and N, we excluded the 
ORF3b and the ORF9b from this analysis. The ORF8a and 
ORF8b are present as two separate ORFs in most human 
isolates but as a single ORF (ORF8ab) in isolates from ani- 
mals and early phase human due to the presence of extra 
29 nt in this region, thus resulting in the fusion of ORF8a 
and ORF8b. Because of the difficulty in obtaining a relia- 
ble alignment in this region, ORF8 (a, b or ab) was 
excluded from our analysis as well. For similar reasons as 


Groups? No. of sequences? Mean dy-d<& Positively selected site(s) 
Spike 
BAT 4 -0.957 
PCO3 3 -0.904 
HPEM 14 0.061 49, 75, 77, 144, 239, 244, 311, 344, 778, 860, 861, 1001, 1148, 1163, 1179, 1247 
HPL I -0.138 
PCHP04 27 0.938 
SARS» 40 0.361 75, 147, 227, 239, 243, 244, 311, 462, 479, 609, 613, 743, 765, 778, 1080, 1163 
Replicase 
BAT 4 -0.985 
HPO03 21 0.008 654 
PCHP04 8 -0.774 
SARS 31 -0.561 
3'-end ORFs 
BAT 4 -0.91 
HPEM 5 -0.42 
HPML 33 0.152 
PCHP04 12 -0.571 
SARS* 40 -0.301 


a. At least 3 sequences are needed for REL analysis, so PCO3 groups of dataset 2 and 3 were not analyzed. 

b. The upper limit in number of sequences for REL test is 40, so 15 sequences were removed from original SARS group (removed sequences' 
number: 33, 40, 43, 50, 86, 108, 110, II 1, 123, 135, 139, 144, 147, 152, 156) 

c. 12 sequences were removed from original SARS group (removed sequences’ number: 5, 19, 50, 56, 57, 81, 82, 88, 91, 103, 107, 111) 

d. As a rule of thumb, at least 10 sequences are needed to detect selection at a signal site with reliability. So some of the results may be not reliable 


because of not enough sequences are available for some groups. 


e. Because d, could be 0 for some sites, Datamonkey reports dy-ds in place of dy/ds 
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Table 4: Maximum likelihood (ML) estimates for 35 merged replicase genes of SARS-CoV 


Models d.f. Parameters under Parameters InLo 2Al P-value _ Positively 
null under (InL;) selected 
model alternative sites* 
model 
Branch Model 66 MO (one ratio) Free Ratio -14460.634 212.17 <0.001 Not allowed 
MO vs. FR @ = 0.024 @ = 0~co (-14354.549) 
Site Model 4 MO (one ratio) M3 (discrete, K = 3) -14460.634 26.866 <0.001 None 
MO vs. M3 @ = 0.024 Po= 0, Wp= 0 (-14447.201) 
p\= 0.972, w,= 0.016 
(p= 0.028), w, = 0.360 
Branch-site model A 
BSL group | MA’ (fix @.= 1) MA -14449.205 0 1.000 None 
as foreground Po = 0.993, ®) = 0.020 py = 0.993, Wy = 0.020 (-14449. 205) 
MA' vs. MA p, = 0.007 p, = 0.007, w, = 1.000 
(P2.*Pap = 0) (P2.*P2p = 0) 
PCO3 group I MA' (fix @2 = 1) MA -14449.205 0 1.000 None 
as foreground Po = 0.993, @) = 0.020 py = 0.993, Wy = 0.020 (-14449. 205) 
MA' vs. MA p, = 0.007 p, = 0.007, w, = 1.000 
(P2a+P2p = 9) (P2a+P2 = 9) 
HPO3 group | MA’ (fix @= 1) MA -14389.596 6.948 0.008 23, 123, 222, 236, 237, 250, 
as foreground Po = 0.322, j= 0.015 — po= 0.913, wo= 0.015 (-14386.122) 266, 375, 377, 409, 504, 563, 
MA' vs. MA p, = 0.002 p= 0.006, w = 11.093 646, 654, 884, 1234, 1259, 1482, 
(Pz,+P2p = 0.676) (P2,*+P2 = 0.081) 1491, 1786, 1866, 1869, 1878, 
1963, 1995, 2010, 2032, 2034 
PCHP04 group | MA’ (fix = 1) MA -14435.921 0 1.000 None 
as foreground Po= 0.761, p= 0.018 — po= 0.760, w= 0.018 (-14435.921) 
MA' vs. MA p= 0.006 p= 0.006, w, = 1.000 
(P2.*Pop = 0.234) (P2.*P2, = 0.234) 
SARS group I MA’ (fix @.= 1) MA -14405.997 0.006 0.938 
as foreground Po = 0.850, ®) = 0.012 py = 0.857, wy = 0.012 (-14405.994) 
MA' vs. MA p, = 0.005 p, = 0.005, w= 1.061 


(PoatPap = 0.145) (Doa*Pop = 0.138) 


* Positively selected sites are identified with posterior probability p > 90%. In boldface, p > 95%. 


mentioned above, we merged the 7 remaining ORFs into 
a supergene for analysis. 


As presented in table 5, the results of FR model analysis 
revealed that selective pressures vary among lineages. The 
results of M3 model also implied variation in selective 
pressure among different amino acid sites. However, the 
M3 model did not detect any sign of positive selection. 
The results of branch-site model A revealed that, except for 
the BSL, PCO3 and HPEM groups, the other groups dis- 
played positive selection signatures (Table 5). For the 
HPML group, about 12.2% sites of these ORFs were 
shown to be under positive selection with w = 9.863. 
Twenty five specific sites were identified: 6 in ORF3a (11, 
29, 85 129, 136, 222); 4 in E (279, 280, 304, 319); 9 inM 
(377, 388, 418, 423, 436, 449, 463, 469, 504); 1 in ORF6 
(584); 1 in ORF7a (696); and 4 in N (850, 932, 954, 993). 
When the PCHP04 group was defined as foreground, the 
branch-site model A analysis estimated that 1.9% sites 
were under positive selection with m = 22.447 and four 
sites were identified to be under positive selection (2 in 
ORF3a, 1 in ORF6, 1 in N). For the whole SARS-CoV col- 
lection, the branch-site model A analysis revealed 12.2% 


sites of these ORFs to be under positive selection with @ = 
3.138. A total of 17 sites were identified at p > 90% level. 
Among them, 9 are located in ORF3a, 3 in M, 2 in ORF6 
and 3 in N. In addition, a large number of sites were iden- 
tified to be potentially under positive selection at p > 70% 
level (see additional file 3B). 


Discussion 

Natural selection generally leads to a reduction in delete- 
rious mutations while promoting advantageous muta- 
tions. If a gene is highly divergent, there are two main 
explanations: first, it may be due to high mutation rate or 
relaxed selective constraint, in which case the gene may be 
free to mutate mainly because it has no fitness or function; 
or second, due to positive selection which is promoted by 
natural selection and the gene usually has highly impor- 
tant functions [45]. Virus entry, replication, assembly and 
release are the main steps of viral life cycle. Proteins 
involved in each of these steps may undergo adaptive evo- 
lution after a virus invades a new host. 


Recombination and mutation are two important evolu- 


tionary mechanisms driving gene diversity and adapta- 
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Table 5: Maximum likelihood (ML) estimates for 56 merged 3'-end ORFs of SARS-CoV 


Models df. Parameters Parameters InLo 2Al P-value _ Positively 
under under (InL;) selected 
null alternative sites* 
model model 
Branch Model 108 MO (one ratio) Free Ratio -9142.692 173.623 <0.001 Not allowed 
MO vs. FR @ = 0.170 @ = 0~co (-9055.881) 
Site Model 4 MO (one ratio) M3 (discrete, K = 3) -9142.692 99.114 <0.001 None 
MO vs. M3 @ = 0.170 Po= 0, @p= 0 (-9093.135) 
p\= 0.866, w,= 0.058 
(p= 0.134),w.= 0.986 
Branch-site model A 
BSL group I MA' (fix @= 1) MA -9093.137 0 1.000 None 
as foreground Po= 0.868, W = 0.059 —po= 0.868, w= 0.059 (-9093.137) 
MA' vs. MA p= 0.132 p= 0.132, w2= 1.000 
(P2.*P2p = 0) (P2*P2p = 0) 
PCO3 group | MA' (fix @= 1) MA -9093.137 0 1.000 None 
as foreground Po= 0.868, Wo= 0.059 — Po= 0.868, w= 0.059 (-9093.137) 
MA' vs. MA p,= 0.132 p= 0.132, w, = 1.000 
(P2a*Pop = 0) (P2a*Pop = 0) 
HPEM group | MA' (fix @= 1) MA -9093.127 0.078 0.780 
as foreground Po= 0.855, Wp= 0.059 — po= 0.861, p= 0.059 (-9093.088) 
MA' vs. MA p= 0.130 p\= 0.131, o »= 4.300 
(P2.*P2p = 9.015) (Pr.*P2p = 0.008) 
HPML group | MA' (fix @= 1) MA -9069.427 8.614 0.003 11, 29, 85, 129, 136, 
as foreground Po= 0.125, Wp= 0.047 — po= 0.772, Wo= 0.046 (-9065.120) 222, 279, 280, 304, 319, 
MA' vs. MA p,= 0.017 p= 0.106, o »= 9.863 377, 388, 418, 423, 436, 
(Prat Pop = 0.858) (Prat Pop = 0.122) 449, 463, 469, 504, 584, 
696, 850, 932, 954, 993 
PCHP04 group | MA' (fix @= 1) MA -9087.427 22.502 <0.001 25, 259, 609, 1184 
as foreground Po= 0.690, W p= 0.055 — po= 0.862, wy = 0.057 (-9076.176) 
MA' vs. MA p= 0.097 p)= 0.119, @ 9= 22.447 
(Pr.*Pop = 9.213) (P2a*Pop = 0.019) 
SARS group | MA' (fix @ = 1) MA -9058.932 15.402 <0.00! II, 15, 81,117, 120, 121, 
as foreground Po = 0.664, Wo= 0.033 Po= 0.804, w= 0.037 (-9051.231) 171, 193, 259, 355, 361, 560, 
MA' vs. MA p= 0.066 p= 0.074,  ,= 3.138 609, 628, 830, 850, 1184 


(Prat Pr = 0.270) 


(Prat Pr = 0.122) 


* Positively selected sites were identified with posterior probability p > 90%. In boldface, p > 95%. 


tion. Since recombination can affect the detection of 
positive selection, we first tested for recombination in our 
datasets [46]. GARD detected no evidence of recombina- 
tion within the replicase and 3'-end ORFs, while one puta- 
tive breakpoint in spike protein was detected. Whether 
there is any recombination among SARS-CoV is still 
debatable [13,22,47-49]. Previous studies suggested puta- 
tive recombination only when analysis of SARS-CoV 
sequences were put together with other coronaviruses [47- 
49]; however, when analyses were focused solely on SARS- 
CoV sequences, recombination could not be detected 
[13,22]. We also tested recombination in SARS group 
alone for each dataset. No evidence of recombination was 
detected by GARD. These results might imply that there 
could be some ancient recombination events occurred 
between bat SARS-like-CoV and the ancestor of SARS- 
CoV, which drove the bat SARS-like-CoV adaption to civet 
and human. Nevertheless, previous studies had revealed 
that detection of positive selection by LRT method was 
robust to low levels of recombination (with fewer than 


three recombination events), and identification of sites 
under positive selection by the empirical Bayes method 
appeared to be less affected than the LRT by recombina- 
tion [46]. Overall, the issue of recombination among RNA 
viruses is highly controversial because the putative recom- 
bination events described were detected only by utilizing 
computationally-demanding phylogenetic analyses (split 
decomposition and/or maximum likelihood methods). 
Therefore, caution should be used when inferring conclu- 
sions about putative recombination events that are based 
only on such analyses. Because viable clonal recombinant 
viruses have been rarely observed in nature, for natural 
recombination leading to the transmission of a recom- 
binant strain to be conclusively confirmed, the following 
prerequisites should be met: (i) the recombinant crosso- 
ver should be demonstrated in a single PCR amplicon fol- 
lowing cloning to ensure it occurs in a single DNA 
molecule; (ii) the recombination should be demonstrated 
repeatedly in clonal populations of viable virus (e.g. a 
plaque harvest or limited endpoint dilution; and (iii) the 
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recombinant should maintain adequate sequence conser- 
vation during post-recombination evolution [50]. 


The S protein is a structural protein of coronavirus. It has 
a crucial role in the binding of virus to host receptor and 
subsequent fusion between the viral and host mem- 
branes, both processes being important for virus entry 
into host cell. In the case of several mammalian and avian 
coronaviruses, the S protein is cleaved into S1 and S2. The 
former contains receptor attachment sites and the later is 
involved in the fusion of CoV onto host cell. The $1 sub- 
unit, which usually has high divergence, contains a recep- 
tor binding domain (RBD); the S2 subunit, which is 
comparatively more conserved, contains two heptad 
repeat (HR) domains [51]. Several studies revealed that 
the S gene of SARS experienced noticeable positive selec- 
tion during the SARS epidemics, especially in the early 
and middle phases [20-22]. However, our analyses indi- 
cated that the S protein of SARS-CoV underwent a step- 
wise adaptive process subsequent to its spillover into the 
civet and human populations. In the BSL group, our anal- 
yses suggest that the bat SARS-like-CoVs experienced puri- 
fying selection, indicating that the S gene is relatively 
stable in bats. In palm civet, SARS-CoV experienced strong 
positive selection as indicated by the results of PCHP04 
group. The failure to identify significant positive selection 
in PCO3 group was most likely due to the limited number 
of sequences available for analyses (only four sequences 
for PCO3 group and two of them have identical S gene 
sequences). During the early and middle phases of the 
2002-2003 SARS epidemic in human population, a small 
fraction of sites among the S protein were under strong 
positive selection. In contrast, isolates from the late phase 
showed no sign of positive selection, implying the S pro- 
tein became stable again after two stages of adaptive evo- 
lution. Our analysis using the HYPHY package accessed 
through Datamonkey facility also revealed similar evolu- 
tionary patterns for the S protein (Table 3). Taken 
together, these results support the hypothesis that SARS- 
CoVs originated in bats, that the spill over into civets and 
humans were recent events and that the two SARS epi- 
demics that took place one year apart, were results of inde- 
pendent animal-to-human transmissions [7,20]. The 
major sequence difference of the S genes between bat 
SARS-like-CoVs and civet/human SARS-CoVs suggests 
that there may be other more closely related SARS-CoVs in 
bats or there may be other unknown intermediate animal 
host(s) in the transmission of the virus(es) from bats to 
civets [49]. 


Among the SARS-CoVs from human and palm civets, 
numerous sites are identified to be potentially under pos- 
itive selection (Table 2 and see additional file 3A). Those 
sites inferred from different SARS epidemic phases reflect 
the adaptation process of SARS-CoVs. Those sites identi- 
fied from the PCHP04 group may be important for SARS- 
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CoV adaptation to palm civets. Those sites identified from 
the HPEM group may be important for SARS-CoV adapta- 
tion to human. Most of these sites, especially from the 
entire SARS group, are located in the $1 domain that con- 
tains the receptor binding domain. Zhang et al. [22] pre- 
viously identified 12 positively selected sites in the SARS- 
CoV group, all of which were confirmed in our current 
study. The greater number of sites identified in our study 
is likely due to the fact that the branch-site model is more 
powerful than the site-specific model. Some of these sites 
have been confirmed by experimental data to be crucial 
for the adaption of SARS-CoVs to human. For example, it 
has been found that adaptation of S protein to human 
angiotensin converting enzyme 2 (ACE2) is facilitated by 
alteration of residue 479 to asparagine and of 487 to thre- 
onine [52,53]. Also, using site-directed mutagenesis, Zhu 
and Chakraborti identified that residues 344, 392, 426, 
431, 479, 480, 487, 488 and 494 are important for the 
binding of RBD with ACE2 and SARS-CoV antibody 
[54,55]. In RBD, there are eight newly identified sites 
(333, 336, 353, 391, 440, 442, 457, 459, and 462) which 
have not been proved to be critical for RBD and ACE2 
interaction. Furthermore, there are ~G60 sites to be under 
positive selection beside RBD. Although evaluation of 
every observed site under positive selection by reverse 
genetics would not be realistic or feasible, generation and 
evaluation of mutant viruses based on sites located within 
or adjacent to functional domains could provide clues for 
the genetic aetiology of SARS adaptation to new hosts and 
emergence. 


The first two thirds of coronavirus genome encode two 
large polyproteins: Pla and Plab, which are cleaved by 
virus-encoded proteinases (PLpro and 3CLpro) into 16 
non-structure proteins (nsp1-nsp16) playing important 
role during coronavirus replication. Because the Plab is 
too big (~21 kb), we analyzed four most important 
domains related to viral replication: PLpro, 3CLpro, RdRp 
and Hel, which correspond to nsp3, nsp5, nsp12 and 
nsp13, respectively [17,43,44]. Our results revealed that, 
unlike the adaptive evolutionary pattern of S protein, 
these replicase domains did not experience positive selec- 
tion in bat and palm civet, but underwent strong positive 
selection in human patient. Moreover the evidence of pos- 
itive selection is stronger in the later phases than that 
observed in the early and middle phases (data no shown). 
Furthermore, our analysis using HYPHY package observed 
very weak positive selection in the HP03 group but not 
among the other groups (Table 3). These results to some 
extent differ from the observations derived from several 
previous studies. Using pairwise analysis of the dy/dg, 
Song et al. found that the average w ratio of S protein for 
the early phase was larger than that for the middle phase, 
which in turn was larger than the ratio for the late phase. 
A similar pattern was found in ORFla and ORF3a but not 
in ORF1b and nsp3, which were suggested under purify- 
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ing selection during the whole course of the epidemic. 
They identified over 200 single-nucleotide variations 
(SNVs) and inferred the importance of some SNVs on 
SARS-CoV adaptive evolution [21]. Zhang et al. also inves- 
tigated the adaptive evolution of the S protein employing 
the site specific model, but they did not observe any pos- 
itive selection in RdRp, Hel or nsp3 [21,22]. The best 
explanation for this apparent discrepancy is that the meth- 
ods used in their studies were more conservative than the 
branch-site model used in our study, and hence were not 
able to detect positively selected changes among the 
highly conserved genes. Alternatively, it may be due to the 
use of concatenating analysis in our study, which can pro- 
vide additional information due to the compiling of more 
dissimilar sequences for datasets. For the replicase genes 
of bat and civet isolates, there was no sign of positive 
selection. This is probably due to the following reasons: 1) 
the civet isolates were collected within a very short time 
period and thus there was not enough time for adaptive 
evolution; 2) civet cells are very suitable for SARS-CoV 
replication which may imply that the civet is a perfect 
intermediate host for SARS-CoV; and 3) the bat isolates 
might have completely adapted to their hosts and hence 
were under no further selective pressure for evolution. 


As to the 3'-end ORFs, the most diversifying selection hap- 
pened in the middle and late phases of the SARS epidemic 
in 2003-2004. No positive selection was found in the 
BSL, PCO3 and HPEM groups. When the isolates from two 
epidemics of 2003 and 2004 were investigated together, 
12.2% sites in these 7 ORFs were shown to be under pos- 
itive selection (Table 5). In addition to a few sites identi- 
fied at p > 90% level, many sites are inferred to be under 
positive selection at 70-80% posterior probability (see 
additional file 3B). Most of these sites are located in 
ORF3a, E, M and ORF6, implying these four genes may 
play a more important role for SARS-CoV adaptation in a 
new host. Our results based on the REL method showed 
weak positive selection in HPML group, but not in other 
groups (Table 3). The failure to identify specific sites 
under positive selection could be due to weak and thus 
undetected positive selection. These results suggested that 
the 3'end-ORFs underwent positive selection after SARS- 
CoV spilled out into civet and human populations, and 
adaptive evolution mainly happened in the middle and 
later phases in 2003. Song et al. previously suggested that 
the 3a protein evolved adaptively as S protein [21]. By 
estimating mutation rates, Zhao et al. suggested that the 
non-synonymous substitution rates were comparatively 
high in E, M and N [56]. ORF3a encodes a protein of 274 
amino acids. A recent study indicated that the 3a protein 
forms a potassium sensitive channel that may promote 
virus release and may be important for modulating 
expression of S on the cell surface [16,57]. 3a protein also 
interacts with the structural proteins S, E, M [57,58]. 
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Therefore, amino acid changes in 3a protein might be nec- 
essary to maintain the interaction between 3a and other 
proteins. E and M protein play a pivotal role in viral mor- 
phogenesis, assembly and budding. Co-expression of E 
and M was shown to produce virus-like particles, roughly 
the same size and shape as virions [59]. N protein is 
important for viral packaging which is the first step in the 
assembly of infectious SARS viruses [60]. Thus the amino 
acid changes in these three structural proteins may be crit- 
ical for virus assembly in the new host. SARS-CoV ORE 6 
protein can enhance the virulence of attenuated murine 
coronavirus (MHV) [61], as well as stimulate cellular 
DNA synthesis [62]. A recent study showed that ORF 6 
protein inhibited both interferon synthesis and signaling 
[63]. These findings suggested that ORF 6 may have a role 
in enhancing virus replication or assembly. ORF 7a pro- 
tein inhibits cellular protein synthesis and blocks cell 
cycle progression at GO/G1 phase, suggesting that 7a may 
play important roles in the life cycle of SARS-CoV and the 
pathogenesis induced by SARS-CoV [64,65]. The function 
of other accessory proteins remains to be determined. 
Overall, these findings suggest that the 3'-end ORFs play 
important roles for SARS-CoV replication, assembly and 
release. Collectively, amino acid changes in these proteins 
could play a role in modulating the host switch of SARS- 
CoV. 


Conclusion 

We systematically analyzed the individual SARS-CoV pro- 
teins important for virus entry, replication and assembly. 
The results suggested that SARS-CoVs experienced a step- 
wise adaptation to humans. In palm civets and humans 
during the early and middle epidemic phases, virus entry- 
mediating protein S experienced strong positive selection. 
In contrast, the replicase proteins experienced positive 
selection only in human patients but not in palm civets, 
implying that the palm civet is a suitable intermediate 
host for SARS-CoV replication. The proteins involved in 
virus assembly and release mainly underwent positive 
selection during the middle and later epidemic phases. 
These results highlight the spectacular dynamics of SARS- 
CoV evolution in a narrow time window, period of epi- 
demic, support the zoonotic origin of SARS and suggest 
that some amino acid sites may be critical for viral adap- 
tation in different hosts. Collectively, these results suggest 
that the development of SARS-CoV reverse genetics system 
will facilitate further molecular and/or epidemiological 
investigations to elucidate role of adaptive virus evolution 
in future emergence events. 
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